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Parallel processing array 



The invention relates to a parallel processing array, for exanq>le a single 
instruction multiple data (SIMD) processing array, having a plurality of processing elements 
(PEs). In particular, the present invention relates to a parallel processing array which is 
adapted to inQxrove the efficien^^ of the array when handling data dependent processing 
5 operations. 

In a SIMD processing array, each processing element (PE) in the array 
receives the same instruction via a common instruction stream, and executes the instruction 

10 hased on local data which is unique to that processing element Parallel processing arrays are 
therefore well suited to performing highly repetitive tasks where the same operations are 
performed on multiple data at die same time, such as those associated with image processing. 
SIMD therefore provides an area efScient, scalable, low power inq)lementation. While SIMD 
is suited to applications which involve significant repetition in the data and the processing of 

15 die data, SIMD is not so suited for performing data dependent processing operations. 

For example, in video processing (e.g. deinterladng, noise reduction, 
hort2sontal dynamic peaking) most of the operations are exactly the same for all data elements 
in the array, and therefore make efficimt use of tiie SIMD array. However, data dependent 
processing operations, such as look-up table operations or multiplication with different 

20 coeflBcients based on the location of data in an array, do not make eflScient use of a SIMD 
Iirocessing array. 

Fig. 1 shows a schematic of a typical processing element (PE) 1 in a Xetal 
SIMD processing architecture (Xetal being a low power parallel processor for digital video 
cameras). The processing element 1 comprises an arithmetic logic unit (ALU) 3, a 
25 multiplexer (MUX) 5, an accumulator (ACCU) 7 and a flag register (FLAG) 9. The 

processing element 1 receives a broadcast instruction 10, which will be recdved by all other 
processing elements in the array (not shown). The ALU 3 processes the instruction 10 based 
on local data. The accumulator 7 is provided for storing a last result, which can be used as the 
operand for the next instruction 10. Typically, the ALU 3 comprises an adder and a 
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multiplier, thereby enabling comparisons, addition, subtraction, data weighting and multiply- 
accumulate operations to be performed within one clock cycle. Typically, the flag register 9 
contains a 1-bit flag which is set according to the last result. Based on this flag status 
conditional pass-instructions are possible, allowing a limited form of data dependency in the 
5 algorithms. 

It is noted that the multiplexer 5 is controlled by the broadcast instruction 10. 
In the Xetal architecture, the multiplexer 5 receives a number of input signals which are 
selectively connected to the ALU 3. For example, the multiplexer 5 receives data firom part of 
the line memory 6, and data from left and ri^t communication channels 8, 12. The 

10 multiplexer 5 also receives coefiBcient data (coefl^ 16. In this manner, the ACCU signal 14 
from the accumulator 7 is used as one operand for an operation, and the multiplexer 5 selects 
the second qpetand. Hierefore, the second operand can be selected from the left 
communication channel 8, the rigiht communication channel 12 or the line memory 6, or a 
^^fixed" number &om the coefficient input 16. 

15 In this kind of processor, performing data dependent processing, for example 

retrieving a value from a look-iq) table, or performing different opocations with difEerent data 
elements of the same array, are either not possible or require many long and conoplicated 
iterations. This results in poor efiBciency of the SIMD processing array. 

For example, performing a multiplication after retrieving a value from a look- 

20 vp table of ten elements requires forty operations in the Xetal architecture shown in Fig. 1. 
An iinplementation using the Xetal instruction set is sihown below, in which, for a look-iqj, 
the value has to be compared to a lower limit: 

Assume rO has the data elements of a desired array. 
1. accu=rrO; —move tihie data to the accumulator 

25 2. accu==MAX(accu, lower JimitO); -find max of rO and lowerJUmitO and store 

inaccu 

3. accu=accu*coefieO; —this is the operation in the interval 

4. rl=PASSC(accu,rl); —if rO was in the region keep tiie result, otherwise copy 
rl to the next interval 

30 As shown above, for all entries in the look-up table (LUT), for example ten in 

tiie example provided above, all four of the above operations must be performed. This means 
that for a ten element LUT, forty operations are required. 
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The aim of the present invention is to provide a parallel processing anay that 
does not suffer from the disadvantages mentioned above, when processing data dependent 
operations. 



According to a first aspect of the present invention, there is provided a parallel 
processmg array comprising a plurality of processing elements (PEs), each processing 
element receiving a common instruction and comprising: 

a multiplexer for receiving said common instruction; 
10 an arithmetic logic unit, connected to said multiplexor, for processing tiie 

received instruction in association with an accumulator and a fktg register; 
characterized in that one or more of the processmg elements in the processing array further 
con^xrises a storage element having at least one storage location, the storage element 
configured to he indirectiy addressable by the received instruction, thereby enabling the 
15 processing of data dependent operations to be performed. 

The processmg array defined above has the advantage of being more efficient 
than a conventional processing array when performing data dependent processing operations. 

According to another aspect of the present invention, there is provided a 
method of processing data in a parallel processmg array comprising a plurality of processing 
20 elements (PBs% each processing element receiving a common instruction and comprising a 
multiplexer for receiving said common instruction, and an arithmetic logic unit, connected to 
said multiplexer, for processing the received instruction in association with an accumulator 
and a flag register, the method comprising the steps of: 

providing a storage element in one or more of the processing elements in the 
25 processing array, the storage element having at least one storage location; 

configuring the storage element to be indirectiy addressable by the received 
instruction; and 

processing data dependent operations using the storage element 



30 



For a better understanding of the present invention, and to show more clearly 
how it may be carried into effect, reference will now be made, by way of exairqple, to the 
accompanying drawings, in which: 
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Fig. 1 shows a schematic diagram of a piocessing element of a parallel 

processing array according to the prior art; and 

Fig. 2 shows a schematic diagram of a processing element of a parallel 

processing array according to the present invention. 



Fig. 2 shows a processing element 1 of a processing array according to the 
present invention. As shown in Fig. 1 above, the processing element 1 cooapises an 
arithmetic logic unit (ALU) 3 , a multiplexer (MUX) 5, an accumulator (ACCU) 7 and a flag 

10 register (FLAG) 9. The operation of these elements for '^conventional" processing, i.e. non- 
data dependent processing, is the same as that described above in relation to Fig. 1 . 

According to the invention, the processing eUmsat fbrttm conqirises a storage 
element (SE) 1 1, which supports the processing of local customized (i.e. data dependent) 
processing in the processing element 1 . 

15 The storage element 1 1 coirqprises a number of storage locations SEi to SEn- 

The number of storage locations are selected in the design process depending on the 
particular qiplication, and can be any inte^ value. The storage element 1 1 receives uqput 
data 13 (data_in) via a multiplexer IS. The multiplexer 1 5 is connected to receive 
accumulator data 14 from the output of the accumulator 7, and coefiScient data 16 (coefiQ 

20 from a coefficient port of the processing element 1. The multiplexer IS is arranged to 

selectively provide the accumulator data 14 or coefficient data 16 as the ii^t data 13 to the 
storage element 1 1 , under control of a control signal 17, which comes fiom or forms part of 
the broadcast instmction 10. 

The storage element 11 also receives an index signal 19, which is coimeoted to 

25 the output of a multiplexer 21. The multiplexer 21 is also connected to receive accumulator 
data 14 fixwn the accumulator 7, and coefficient data 16 (coeflF) from the coefficient port of 
the processing element 1. The multiplexer 21 is also controlled by the control signal 17, 
which comes from or forms part of the broadcast instraction 10. Output data 22 (data_out) 
fix)m the storage element 1 1 is comiected to the input of the multiplexer 5 of the processing 

30 element 1 . Preferably, a register 23 (curr se) is provided between the output of the storage 
element 1 1 and the multiplexer 5, which can be used for storing a value of the storage 
element 1 1, as will be described in greater detail later in the application. 

Next, the operation of the embodiment of Fig. 2 will be described in relation to 
storing different coefficients in each PE to provide multiplication (or any other operation) 
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with a dififecent coemcient per data in the array, and the perfotmiog of a look-up table 
operation, though the instruction used for all PEs remains die same. 

Whea the storage element 11 is used for staring different coefiBdenls, the 
coefficient data 16 is used as an index for the storage element table, and the accumulator data 

5 14 fixim the accumulator 7 or line memory is stored at a cons&pondingstora^ location SE 
in the storage element 1 1 . In odier words, the mnlttplexer 1 5 is controlled to pass the 
accumulator data 14 as the input data 13 to the storage element 11, while the multiplexer 21 
is controlled to pass the coefficient data 16, which acts as the index 19 to the respective 
storage location SEy in yMch the data is to be stored. This enables the correct values to be 

10 stored in the conect locations SEi-SEn in the storage element 11. Alternatively, if desired, 
the coefficients can be stored by jq>plying the coeffiaent data to the input 13 and uang the 
accumulator data as the index 19. 

When loading a value fiom the storage element 11, in a similar manner to the 
above, the coefficient data 1 6 is used as an index 1 9 to make a value fixim the respective 

15 storage location of the storage element 1 1 available at the output, tiiereby enabling 

multiplication whh the data in the accumulator 7 to be performed. Jn other words, when 
loadiug data fiom die storage elemrait 1 1, Ae multiplexer 21 is arranged to pass the 
coefficient data 16 as the index 19 for the storage element 1 1, thereby outputting the 
reflective ou^ut data 22 fiom the stcnrage element 1 1 . The output data fiom the storage 

20 element 1 1 is passed via flie multiplexer 5 to the ALU 3, for multiplication with the data fiom 
the accumulator 7. 

When using the storage element as a Iook-i^ table (LUT) in each PE, there are 
a number of alternative ^proaches for storing the correct values (lower_limit, resultant 
value) in the storage element 11. 

25 One approach is to use a part of the coef input as the index, and the other part 

as the value to be stored. In other words, part of the coefficient data 16 is passed by die 
multiplexer 21 as the index 19 for the storage element 11, while the other part of the 
coefficient data 1 6 is passed by multiplexer 15 as the value to be stored. Altiiough this 
method has the disadvantage of increasing the width of the coefficient data signal, it has the 

30 advantage of storing values in one cycle. 

Another ^roach is to generate the address or index 19 with the he^ of the 
accumulator 7 and/or the ALU 3, which enables generation of different addresses and m turn 
enabling the same values to be stored at different locatioiss in stora^ elements of PEs, by 
applying that address to the storage element 1 1 with a stcse instruction, such that the value of 
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the coefficient data 16 is stored in the respective storage location SEy of the storage element 
1 1 . In this arrangement, the multiplexer 1 5 is arranged to pass the coefficient data 1 6, while 
the multiplexer 21 provides an index 19 which has been generated by the accumulator 7 
and/or ALU 3, This approach has the advantage of requiring a narrower coefficient data 
5 signal, but has the disadvantage of requiring one extra processing operation. 

For loading a value from the storage element 1 1, the value of the accumulator 
7 is used as the index 19, and the looked up value fix)m the storage element 1 1 is stored in the 
register 23 (curr_se) for furliher use. In other words, the accumulator data 14 fixjm the 
accumulator 7 is passed by the multiplexer 21 to provide an index 19 for the storage element 
10 11. The corresponding value ficom the respective storage location SEy fomis the output data 
22 of the storage olemmt 11, and is either passed directly to the multiplexer 5, or stored in 
the register 23 for later use. It is noted that, when running at low frequency the register 23 
(curr^se) can be bypassed to perform operations in one cycle. 

The invention described above provides an improved processor array, since it 
IS enables the processing elements to operate in any of the following ways: 

a) every PE executes the same operation based on a broadcast instmction, (i.e. 
'^normal" operation) 

b) a PE uses a diflSarent coefficient based on the data to be processed to execute 
the same bioadcast instruction, or 

20 c) the PE executes a fbiction that has been described in a look-1^ 

broadcast for all PEs to do LUT operations. 

In a video processing application, for example, most of the functions can be 
performed on line based processing (for example, deinterlacing, noise reduction, horizontal 
dynamic peaking) or can be expressed in terms of line based processing (for example, an 
25 upconverter with 2x2 block size can be processed on line basis by accumulating 2x2 blocks 
as two lines and performing line based processing). This means that most operations are 
exactiy the same for all data elements in the array, and can therefore be performed using the 
**normal" PE operation described above in (a). 

However, when performing tasks such as multiplication (or other operands) 
30 with different coefficients based on location of data in an array, the processing element 
according to Ihe present invention is configured to operate as described in (b) above. 

Likewise, for LUT operation, the processing element is configured to operate 
as described in (c) above. 
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The invention has the advantage of utilizing the properties of SIMD 
processing, yet providing a more efficient operation when data dependent processing 
operations are to be performed. For example, Ihe processing of a multiplication with a looked 
up value in the present invention would take around two qperations (depending on tiie 
5 implementation choice for LUT operation), which is about only 5% of the earlier described 
method ^vt^di needed forty instructions. 

The invention therefore provides an indirectly addressable memoiy per 
processing element, which can be used for data dependent operations such as look-iq) table 
operations and accessing different coefficients that can be used with the same instruciion for 
10 aUPEs. 

It is noted that the increased area required for a processing element due to the 
storage element is not a detrimental &ctor, since tlie kiterconnection area is the dominating 
&ctor on a chip like this. Urns, placing the storage elements near the arithmetic logic unit 
(ALU) avoids fixrdaer overloading of the communication network (i.e. wiring overhead). 

IS Although the preferred embodiment has been described in relation to video 

processing, it will be appreciated by a person skilled in the art that the processing elemmt 
according to the present invention can also be used for other fbnctions. 

Furthermore, aUfaoug^ the preferred embodiment has been desodbed in 
relation to the Xetal architecture, the invention is equally applicable to other fi>nns of parallel 

20 processing architecture. 
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CLAIMS: 



10 



20 



1- A parallel processing array comprising a plurality of processing elements 

(PEs), each processing element receiving a common instraction and comprising: 

a multiplexer for receiving said common instruction; 

an arithmetic logic unit, connected to said multiplexer, for processing the 
received instruction in association with an accumulator and a flag register; 
characterized in that one or more of the processing elements in the processing aixay fiirfher 
comprises a storage element having at least one storage location, the storage element 
configured to he indirectly addressable by the received instruction, Iherehy enabling the 
processing of data dependent operations to he performed. 



2- A parallel processing array as claimed in claim 1, wherein the storage element 

comprises: 

an input data port for receiving data to he stored; 

an index signal for addres smg a storage location in the storage element and 
15 - an output port for ou^utting data fix)m the storage element 

3* A parallel processing array as claimed in claim 2, wherein the irput data port 

of the storage client is connected to receive data &om an input multiplexer, the input 
multiplexer heing configured to pass accumulator data or coefBdent data. 



4- A parallel processing array as claimed in claim 2 or 3, wherein the index 

signal is received &otn an indes: multiplexer, tiie mdex multiplexer hemg configored to 
selectively pass accumulator data or coefficient data, or part of the received instruction. 



25 



5. A parallel processor as claimed m claim 3 or 4, wherein the input multiplexer 

and/or index multiplexer is controlled by flie received instruction. 
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6. A parallel processing array as claimed in any one of the preceding claims, 
wherein the storage element is configured to provide the processing element with a 
coefficient based on the data to be processed. 

7. A parallel processing array as claimed in claim 6, wherein the input 
multiplexer is configured to pass accumulator data to the storage element when storing 
coefficient data, the coefficient data being stored in a storage location defined by the index 
signal. 

8. A parallel processing array as claimed in claim 6, wherein the input 
multiplexer is configured to pass coefficient data to the storage elemmt, stored in a storage 
location defined by tiie index signaL 

9. Aparallel processing array as claimed in claim 7 or 8, wherein the index 
signal is defined by coefficient data received by the index multiplexer. 

10. A parallel processing array as claimed in claim 7 or 8, wherein the index 
signal is defined by accumulator data received by the index multiplexer. 

11. A parallel processing array as claimed in any one of the preceding claims, 
wh^ein llie storage element is configured to provide a local look-up table for the processing 
element. 

12. A parallel processing array as claimed in claim 1 1, wherein the inpfot 
multiplexer is configured to pass coefficient data to the storage element for storage in a 
location defined by the index signaL 

13 . A parallel processing array as claimed in claim 1 2, wherein the index signal is 
defined by accumulator data received by the index multiplexer. 

14. A parallel processing array as claimed in claim 1 1, wherein the input 
multiplexer is configured to pass a first part of the coefficient data as the data to be stored in 
the storage element, and the index multiplexer arranged to pass the other part of the 
coefficient data as the index signal defining the storage address. 



10 
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15. A parallel processing array as claimed in any one of the preceding claims, 

further comprising a register for storing data between the ou^ut of the storage element and 
the input of the multiplexer. 

16- A parallel processing array as claimed in aay one of the preceding claims, 

\;(^erein the processing array is a single instruction multiple data (SIMD) processiog array. 

17. A method of processing data in a parallel processing array comprising a 

plurality of processing elements (PEs), each processing elemmt receiving a common 
instruction and comprising a multiplexer for receiving said common instruction, and an 
arithmetic logic unit, connected to said multiplexer, for processing the received instruction in 
association with an accumulator and a flag register, the method comprising the steps of. 

providing a storage element in one or more of the processing elements in the 
15 processing array, the storage element having at least one storage location; 

configuring the storage element to be indirectiy addressable by the received 
instruction; and 

processing data dependent operations using the storage element. 

20 18. A method as claimed in claim 17, further comprising the steps of: 

providing an input data port in the storage element for receiving data to be 

stored; 

providing an index signal for addressing a storage location in the storage 

element; and 

25 - providing an output port for outputting data fix>m the storage element 

19. A method as claimed in claim 18, further comprising the steps of connecting 

the input data port of the storage element to receive data from an input multiplexer, and 
configuring the input multiplexer to pass accumulator data or coefBcient data. 



30 



20. A method as claimed in claim 1 8 or 19, ftirther comprising the step of 

providing an index multiplexer for providing the index signal, and configuring the index 
multiplexer to selectively pass accumulator data or coefScient data, or part of the received 
instruction. 
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21 • A method as claimed in claim 19 or 20, fiirther comprising the step of 

controlling the input multiplexer and/or index multiplexer with the received instruction. 

5 22. A method as claimed in any one of claims 17 to 21 , further comprising the 

step of configuring the storage element to provide the processing element with a coefficient 
based on the data to be processed. 

23. A method as claimed in claim 22, further coniprising the step of configuring 
10 the input multiplexer to pass accumulator data to the storage element when storing coefficient 

data, the coefficient data being stored in a storage location defined by the index signal. 

24. A method as claimed in claim 22, further coniprising the step of configuring 
the input multiplexer to pass coefficient data to the storage element, and storing the 

IS codBBcient data in a storage location defined by the index signaL 

25. A method as claimed in claim 23 or 24, wherein the index signal is defined by 
coefficient data received by the index multiplexer. 

20 26. A noiethod as claimed in claim 23 or 24, wherein flie index signal is defined by 

accumulator data received by the index multiplexer. 

27. A method as claimed in any one of claims 17 to 26, furtiier conqprising ti^e 
step of configuring the storage element to provide a local look-iq) table for tiie processing 

25 element 

28. A metiiod as claimed in claim 27, wherein Ifae input multiplexer is configured 
to pass coefficient data to the storage elemCTt for storage in a location defined by the index 
signaL 

30 



29. A method as claimed in claim 28, wh^ein the index signal is defibaed by 

accumulator data received by the index multiplexer. 
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30. A method as claimed in claim 27, fiortlier comprising the step of configuring 
the input multiplexer to pass a first part of the coefficient data as the data to be stored in the 
storage element, and arranging the index multiplexer to pass the other part of liie coefBcient 
data as tiie index signal defining die storage address. 

31. A method as claimed in any one of claims 17 to 30, ftulfaer comprising the 
step of providing a register for storing data between the output of the storage elem^t and the 
iopat of the multiplexer. 



10 



32. A method as claimed in any one of claims 17 to 31, wherein the processing 

array is a single instruction multiple data (SIMD) processing array. 
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ABSTRACT: 



A processing element 1 forming part of a parallel processing array such as 
SIMD comprises an arithmetic logic unit (ALU) 3, a multiplexer (NfUX) 5, an accumulator 
(ACCU) 7 and a flag register (FLAG) 9. The ALU is configuied to operate on a common 
instruction received by all processing elements in the processing array. The processing 
element 1 further comprises a storage element (SE) 11, which supports the processing of 
local customized (i.e. data dependent) processing in the processing element 1 , such as look- 
up table operations and the storing local coefficient data. 



Fig. 2 
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